DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
In statistical parlance, data is a plural word referring to a collection of numbers or other pieces of information to which meaning has been attached
— Utts (2014)
An observation is an individual that we measure or categorise data about
A value is a singular piece of data about an observation
A variable is a collection of one type of data collected on all observations
… we measure or categorise data about
Numeric Quantitative
Data that can be described with numbers
Categorical Qualitative, Factor
Data that is best described with text
In DATAX121, most, if not all, datasets follow tidy data principles
There are three interrelated rules that make a dataset tidy:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
— Wickham et al. (in press)
Synthetic sample data based on real data from the June quarter 2011 NZ Income Survey1. The survey was an annual snapshot to produce income statistics on New Zealanders aged 15 and over based on a representative sample of the population.
| Variables | |
|---|---|
| ethnicity | A factor denoting the ethnicity with 6 levels |
| region | A factor denoting the region of residence |
| gender | A factor denoting the gender, male or female |
| agegp | A factor denoting the five year age-band. Note that the value 65 describes an individual aged 65 or older |
| qualification | A factor denoting the highest qualtification level with 5 levels |
| occupation | A factor denoting the category of the main income source with 10 levels |
| hours | A number denoting the weekly hours worked from all wages and salary jobs excluding self-employment |
| income | A number denoting gross weekly income from all sources ($) |
The simple graph (plot) has brought more information to the data analyst’s mind than any other device.
— John Tukey
The goal of EDA is that we are simply exploring our data with visualisations and descriptive statistics1
In DATAX121, we cover the basic suite of EDA tools to describe features of the data
stripplot( ~ income, data = nzis.df,
xlab = "Gross Weekly Income ($)",
main = "NZer's gross weekly income snapshot in 2011")Arguments
xlab takes text to label to horizontal (\(x\)) axis
main takes text to label the title of the plot
The variable is plotted as-is on a number line
A function from the lattice R package1 was used to create this plot
One issue with this plot for CS 1.1 is overplotting—we have values that are plotted on top of each other
The gross weekly income seems to be centred at about $2,500. Most of the data seems to be between -$1,000 and -$10,000
stripplot( ~ income, data = nzis.df, jitter.data = TRUE,
factor = 10, xlab = "Gross Weekly Income ($)",
main = "NZer's gross weekly income snapshot in 2011")Arguments
jitter.data takes a value of either TRUE or FALSE
factor takes a number to determine the extent of the jitter
The variable is plotted on a number line and the values are randomly spread along the other axis
Jitter helps avoid overplotting for datasets like CS 1.1. Also, the default extent of the jitter may need to be tweaked per dataset
Jitter also visualises the density of values
The gross weekly income seems to be centred at about $1,500. Most of the data seems to be between -$500 and $5,000. The data is skewed to the right
bwplot( ~ income, data = nzis.df, pch = "|",
xlab = "Gross Weekly Income ($)",
main = "NZer's gross weekly income snapshot in 2011")Arguments
pch = "|" to plot the median with a line instead of a dot
Add coef = 0 to only the “whiskers” without the outliers
The variable is summarised with five descriptive statistics, with outliers (by default), then those features are plotted on a number line
Box plots avoids overplotting by plotting summarised data (and outliers). More on this later!
However, the some features about the distribution of the data are hidden
The median gross weekly income seems to be about $500. The central 50% of the data seems to be between $0 and $1,000. The data is clearly right-skewed
histogram( ~ income, data = nzis.df, nint = 50, type = "count",
xlab = "Gross Weekly Income ($)",
main = "NZer's gross weekly income snapshot in 2011")Arguments
nint takes a number to determine the number of intervals
type takes a value of either "count" or "percent"
The variable is visualised as a set of bars whose widths are equally-sized intervals, but the heights are determined by the number of values within the interval
Histograms avoids the overplotting issue by summarising the frequency of values based on equally-sized intervals
The default number of intervals may need to be tweaked per dataset. Also, histograms may not be suitable for “small” datasets
The gross weekly income seems to be centred at about $1,000. Most of the data seems to be between $0 and $2,000
Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA
| Variables | |
|---|---|
| eruptions | A number denoting the eruption time (in minutes) |
| waiting | A number denoting the waiting time to the next eruption (in minutes) |
stripplot( ~ waiting, data = faithful, jitter.data = TRUE,
factor = 3, xlab = "Waiting time (minutes)",
main = "Waiting time between eruptions") |>
print(split = c(1, 1, 1, 2), more = TRUE)
bwplot( ~ waiting, data = faithful, pch = "|", coef = 0,
xlab = "Waiting time (minutes)") |>
print(split = c(1, 2, 1, 2))stripplot( ~ waiting, data = faithful, jitter.data = TRUE,
factor = 3, xlab = "Waiting time (minutes)",
main = "Waiting time between eruptions") |>
print(split = c(1, 1, 1, 2), more = TRUE)
histogram( ~ waiting, data = faithful, nint = 25,
xlab = "Waiting time (minutes)") |>
print(split = c(1, 2, 1, 2))Sample data from a similar woodblock exercise used in the first lecture. The exercise aimed to estimate the average block weight using only a sample of blocks.
| Variables | |
|---|---|
| Block.ID | An integer between 1–100 denoting the block’s identification number |
| Weight | A number denoting the weight of the block (grams) |
A measure of centre that is often coined as the balancing point of the variable
Let \(x_i\) be the \(i\)th value and \(n\) be the total number of observations. Then, the sample mean is
\[ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{\sum^n_{i=1} x_i}{n} \]
A measure of spread that is mathmatically associated with the mean
Let \(x_i\) be the \(i\)th value, \(\bar{x}\) be the sample mean, and \(n\) be the total number of observations. Then, the sample standard deviation is
\[ \begin{aligned} s &= \sqrt{\frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n - 1}} \\ &= \cdots \\ &= \sqrt{\frac{\sum^n_{i=1} (x_i)^2 - n\bar{x}^2}{n - 1}} \end{aligned} \]
If the data has one mode (unimodal) and is relatively symmetrical, then approximately:
A measure of centre that is often coined as the middle value (50th percentile) of the variable
Let \(n\) be the total number of observations. Then, the sample median, \(m\), can be determined by
A measure of spread that describes the width of the variable
The range is
The difference between the observations with the largest and smallest values
A measure of spread that describes the width of the central 50% of the variable
The interquartile range, \(IQR\), can be determined by
1st Qu., which could be median for lower 50% of the data3rd Qu., which could be median for upper 50% of the dataRecall that box plots visualise “outliers” by default
The (approximate) rules used by most software and packages are:
bwplot( ~ income, data = nzis.df, pch = "|",
xlab = "Gross Weekly Income ($)",
xlim = c(-2500, 2500),
main = "NZer's gross weekly income snapshot in 2011")-843.25, 2046.75
Centre
The “typical size” of the data, e.g. the sample mean & median
Spread
The “variability” of the data, e.g. the sample standard deviation, range, & interquartile range
Outliers
An observation whose value is notably distinct from other values in the data
Cluster
A distinct group of observations—see CS 1.2
Shape (Distribution)
The form of the data, e.g. U-shaped or bell-shaped
Mode (Distribution)
The “frequent value(s)” of the data, e.g. the peaks of a histogram
Symmetrical (Distribution)
A distribution where the two sides approximately match when folded on a vertical centre line
Skewed to the left (Distribution)
A distribution where the data piles up on the right and the tail extends relatively far out to the left
Skewed to the right (Distribution)
A distribution where the data piles up on the left and the tail extends relatively far out to the right
Weight and length measures of 844 snapper, Pagrus auratus, caught in the Hauraki Gulf, near Auckland, New Zealand.
| Variables | |
|---|---|
| len | A number denoting the fork length1 of the fish (centimetres) |
| wgt | A number denoting the weight of the fish (kilograms) |
xyplot(len ~ wgt, data = snapper.df,
main = "Scatter plot of snapper fork length vs weight",
xlab = "Weight (kg)", ylab = "Fork length (cm)")Arguments
ylab takes text to label to vertical (\(y\)) axis
The location of each observation is determined by the value of the two visualised variables
Scatter plots help us describe the relationship between two variables
A simple description addresses the direction (positive or negative) and type of relationship (linear, non-linear, or “none”)
There is a positive non-linear relationship between the fork length and weight of snapper
The extra arguments, type = c("p", "r"), col.line = "black", and lwd = 2 adds a distinct best-fit line to the scatter plot
The best-fit line is a useful aid to help determine if the relationship between two numeric variables is linear
A measure of the strength and direction of a linear association between two numeric variables
If it was appropriate for the snapper data, then \(r=0.95\) (2 dp)
\(r\) helps us describe the strength of a linear association… Why not the strength of a linear relationship?
Values of \(r\) close to \(+1\) or \(-1\) show a strong linear association, while values of \(r\) close to \(0\) show no linear association
\[ r = \frac{\sum_{i=1}^n (x_i \cdot y_i) - n \cdot \bar{x} \cdot \bar{y}}{(n - 1) \cdot s_x \cdot s_y}. \]
\(\text{Let:}\)
\(\bullet ~ n ~ \text{be total number of observations}\)
\(\bullet ~ x_i ~ \text{and} ~ y_i ~ \text{be the} ~ i^\text{th} ~\text{observation's values for the} ~ x ~ \text{and} ~ y ~ \text{variables}\)
\(\bullet ~ \bar{x} ~ \text{and} ~ \bar{y} ~ \text{be the sample means of the} ~ x ~ \text{and} ~ y ~ \text{variables}\)
\(\bullet ~ s_{x} ~ \text{and} ~ s_{y} ~ \text{be the sample standard deviations of the} ~ x ~ \text{and} ~ y ~ \text{variables}\)
A snapshot of Wordle1 guess distributions from David and his Wordle obsessed friends.
| Variables | |
|---|---|
| Count | An integer denoting the frequency of Guesses |
| Initials | A factor denoting whose Wordle guess distribution it is with 5 levels |
| Guesses | A factor denoting how many guesses it took to complete the daily Wordle (as you lose if your 6th guess is incorrect) with 7 levels |
xtabs(Count ~ Guesses, data = wordle.df) |>
as.data.frame() |>
barchart(Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses",
ylab = "Counts", main = "Wordle guess distribution")Arguments
origin = 0 ensures that the bars start at 0 (!?!)
The variable is visualised as a set of bars, one for each level, and the height of each bar is the frequency of the level
The term frequency is used to describe the number of observations with that specific level (category)
Producing a bar plot of counts with tidy data involves a bit more R code for a sensible bar plot1
The most frequent number of guesses required for a Wordle game was four guesses followed by five guesses
xtabs(Count ~ Guesses, data = wordle.df) |>
proportions() |> # Converts the frequencies into proportions
as.data.frame() |>
barchart(Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses",
ylab = "Proportions", main = "Wordle guess distribution")Arguments
The addition of the proportions() line tells R to do the necessary proportion calculations
The variable is visualised as a set of bars, one for each level, and the height of each bar is the proportion of all observations with that level
Bar plot of counts and of proportions are identical for a single categorical variable
However, the frequencies and \(n\) that were used to calculate the proportions are now hidden
The benefit of visualising proportions instead of counts is more evident for two categorical variables
The most frequent number of guesses required for a Wordle game was four guesses followed by five guesses \((n = 1137)\)
Data from a sample of 200 patients following admission to an adult intensive care unit (ICU) in the United States of America.
| Variables | |
|---|---|
| Status | A factor denoting whether the patient lived or died |
| Sex | A factor denoting the patient’s sex, male or female |
| Race | A factor denoting the patient’s race, white, black or other |
| Infection | A factor denoting whether an infection was involved, yes or no |
| Previous | A factor denoting whether the patient has been admitted to ICU within the last 6 months |
| Type | A factor denoting the type of ICU admission, elective or emergency |
| Fracture | A factor denoting whether a fractured bone was involved, yes or no |
Status Sex Race Infection
Length:200 Length:200 Length:200 Length:200
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Previous Type Fracture
Length:200 Length:200 Length:200
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
The bar plots described prior are visualised from frequency tables1
In R, the construction of frequency tables… depends on the data!
If the data already has a numeric variable for the counts of each level of the categorical variable, e.g. CS 1.4:
A measure that describes the number of observations within a level of a categorical variable as a number between \(0\) and \(1\) (inclusive)
Let \(n\) be the total number of observations. Then, the sample proportion for some category is
\[ \widehat{p} = \frac{\text{Number in that level}}{n} \]
Proportions as defined in the previous slide are interchangeable with percentages
Note that a percentage must be written in % (percent) units
Let \(\widehat{p}\) be the sample proportion for some category. Then, the corresponding percentage is
\[ \text{Percentage} = \left(100 \times \widehat{p}\right)\!\% \]
xtabs(Count ~ Guesses, data = wordle.df) |>
proportions() |> # Converts the frequencies into proportions
as.data.frame() |>
barchart(100 * Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses",
ylab = "Percentage (%)", main = "Wordle guess distribution")The two variables are visualised as a set of bars, one for each level combination, and the height of each bar is the frequency of the level combination
Describing relationships between two categorical variables is tricky
The frequencies of each level combination can lead to inappropriate conclusions if the levels of the “by” categorical variable do not have similar frequencies
Arguments
auto.key = list(title = "Race", space = "right") includes a legend on the right-hand side of the plot titled “Race”
xtabs( ~ Status + Race, data = icu.df) |>
proportions("Race") |>
as.data.frame() |>
barchart(Freq ~ Status, groups = Race, data = _, origin = 0,
main = "Status distribution by Race",
xlab = "Status", ylab = "Proportion",
auto.key = list(title = "Race", space = "right"))The two variables are visualised as a set of bars, one for each level combination, and the height of each bar is the proportion within a level of a variable given the level of another variable
Proportions prevent us from quantifying a relationship that is only pronounced due to the frequencies
Note that the proportions from the same coloured bars sum to one, which is why the “X distribution by Y” phrase is used in the plot’s title
It seems that more “black” patients who were admitted into ICU lived compared to “other” and “white” patients
Arguments
proportions("Race") tells R to calculate conditional proportions of Status for each level of Race
groups = Race splits the bars, side-by-side, for Status by the levels of Race
The bar plots described in this section are visualised from two-way tables1
In R, the construction of two-way tables also depends on the data!
If the data already has a numeric variable for the counts of each level combination of the two categorical variables, e.g. CS 1.4:
A measure that describes the number of observations within a level of a categorical variable given the level of another categorical variable as a number between \(0\) and \(1\) (inclusive)
A proportion calculated this way can also be interpreted as a percentage
Let \(n_\bullet\) be the total number of observations for a category level. Then, the sample proportion for some other category given the category level is
\[ \widehat{p}_\bullet = \frac{\text{Number in both levels}}{n_\bullet} \]
xtabs( ~ Sex + Race, data = icu.df) |>
proportions("Race") |>
as.data.frame() |>
barchart(100 * Freq ~ Sex, groups = Race, data = _, origin = 0,
main = "Sex distribution by Race",
xlab = "Sex", ylab = "Percentage (%)",
auto.key = list(title = "Race", space = "right"))The concepts of a proportion and a probability are quite distinct. A proportion is a partial description of a real population—a form of summary. Probabilities tell us about the chances of something happening in a random experiment. The fact that proportions are numerically identical to probabilities for a real population under the experiment “choose a unit at random,” however, means that we can use the probability notation and any formulas derived for manipulating probabilities to solve problems involving proportions as well.
— Wild & Seber (2000)